AITopics | critical batch size

Neural Information Processing Systems http://nips.cc/

batch size, plain sgd, sgd, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.14)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Add feedback

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Neural Information Processing SystemsDec-26-2025, 02:01:13 GMT

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments and analysis using a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization. We demonstrate empirically that the simple noisy quadratic model (NQM) displays many similarities to neural networks in terms of large-batch training. We prove analytical convergence results for the NQM model that predict such behavior and hence provide possible explanations and a better understanding for many large-batch training phenomena.

algorithmic choice matter, batch size, name change, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.82)

Add feedback

Convergence Bound and Critical Batch Size of Muon Optimizer

Sato, Naoki, Naganuma, Hiroki, Iiduka, Hideaki

arXiv.org Artificial IntelligenceNov-24-2025

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

artificial intelligence, batch size, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.01598

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, Natalie, Vyas, Nikhil, Kakade, Sham, Morwani, Depen

arXiv.org Artificial IntelligenceOct-13-2025

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle. With rising compute requirements for training large language models (LLMs), improving optimization methods has become a central strategy for improving training efficiency. Better optimizers can directly reduce the serial runtime to train an LLM, which is crucial for large-scale models that train from days to months. Optimization for LLMs has traditionally leveraged first-order methods such as SGD and Adam (Kingma & Ba, 2017).

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2510.09378

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, Roger B. Grosse

Neural Information Processing SystemsAug-20-2025, 06:18:34 GMT

Neural Information Processing Systems http://nips.cc/

batch size, plain sgd, sgd, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Add feedback

6ceb6c2150bbf46fd75528a6cd6be793-Paper-Conference.pdf

Neural Information Processing SystemsAug-15-2025, 15:58:47 GMT

batch size, behavior policy, proximal policy, (14 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.05)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Add feedback

Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

Umeda, Hikaru, Iiduka, Hideaki

arXiv.org Artificial IntelligenceAug-8-2025

The unprecedented growth of deep learning models has enabled remarkable advances but introduced substantial computational bottlenecks. A key factor contributing to training efficiency is batch-size and learning-rate scheduling in stochastic gradient methods. However, naive scheduling of these hyperparameters can degrade optimization efficiency and compromise generalization. Motivated by recent theoretical insights, we investigated how the batch size and learning rate should be increased during training to balance efficiency and convergence. We analyzed this problem on the basis of stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations needed to reach an $ε$-approximate stationary point of the empirical loss. We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity and validated them through extensive experiments. Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.

artificial intelligence, batch size, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2508.05297

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Neural Information Processing SystemsMay-27-2025, 16:56:43 GMT

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments and analysis using a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.

algorithmic choice matter, batch size, noisy quadratic model, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback

Practical Efficiency of Muon for Pretraining

AI, Essential, :, null, Shah, Ishaan, Polloreno, Anthony M., Stratos, Karl, Monk, Philip, Chaluvaraju, Adarsh, Hojel, Andrew, Ma, Andrew, Thomas, Anil, Tanwer, Ashish, Shah, Darsh J, Nguyen, Khoi, Smith, Kurt, Callahan, Michael, Pust, Michael, Parmar, Mohit, Rushton, Peter, Mazarakis, Platon, Kapila, Ritvik, Srivastava, Saurabh, Singla, Somanshu, Romanski, Tim, Vanjani, Yash, Vaswani, Ashish

arXiv.org Machine LearningMay-21-2025

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2505.02222

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.05)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

How Does Critical Batch Size Scale in Pre-training?

Zhang, Hanlin, Morwani, Depen, Vyas, Nikhil, Wu, Jingfeng, Zou, Difan, Ghai, Udaya, Foster, Dean, Kakade, Sham

arXiv.org Machine LearningNov-8-2024

Efficient optimization is critical in pre-training large models (LMs) at scale (McCandlish et al., 2018; Shoeybi et al., 2019; Kaplan et al., 2020). In particular, large-batch training is key to accelerating training, as it enables more efficient parallelism across hardware accelerators (You et al., 2017; Goyal et al., 2018). Specifically, understanding the scaling behavior of the critical batch size (CBS) is essential for optimizing data parallelism, as it defines the point beyond which increasing the batch size may result in computational efficiency degradation. Below the CBS, approximately linear scaling is achievable--doubling the batch size can proportionally reduce the number of optimization steps required to reach a target loss. However, beyond this threshold, further increases in batch size would lead to diminishing returns, making it essential to balance computational efficiency with model performance (McCandlish et al., 2018; Shallue et al., 2019). This trade-off presents a challenge for studying pre-training given resource constraints as practitioners are compelled to navigate difficult decisions in balancing compute, data, and training time. We investigate the scaling laws governing CBS in the context of autoregressive transformerbased language modeling (Vaswani, 2017; Radford et al., 2018). Analyzing CBS in pre-training is challenging due to the absence of a precise formalism relating it to model and data sizes in the literature (McCandlish et al., 2018; Kaplan et al., 2020).

large language model, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2410.21676

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.88)
(2 more...)

Add feedback

Filters

Collaborating Authors

critical batch size

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Convergence Bound and Critical Batch Size of Muon Optimizer

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

6ceb6c2150bbf46fd75528a6cd6be793-Paper-Conference.pdf

Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Practical Efficiency of Muon for Pretraining

How Does Critical Batch Size Scale in Pre-training?